library(openintro)
library(plotly)
# holds all variables and functions responsible for loading and saving datasets
source("hospitalization_vaccination_data_loading.R")This is a study of vaccination rate and hospital bed usage in the United States inspired by the Washington Post article Mapping America’s hospitalization and vaccination divide. In this project we will be recreating the USA map found in the article and adding interactivity so different dates and variables can be selected.
Washington Post Bivariate Choropleth Map
Vaccination data is provided at the county level by the CDC and hospital bed usage data is provided by HealthData.gov. These variables are visualized with a bivariate choropleth map of Hospital Referral Regions in the United States.
Vaccination Rate is defined by county and HRRs are defined by zip code. We can’t use county data to calculate the vaccination rate of an HRR because these regions overlap. Zip codes in one HRR can live in different counties, and Zip codes in different counties can live in the same HRR.
We need to know both the population of each HRR and the vaccination rate of each part of that population.
We determine the vaccination rate of the HHRs by averaging the vaccination rate (given by county) of the zip codes in that HRR. The individual zip code’s vaccination rate needs to be weighted by that zip code’s population. Population data is obtained from the United States Census Bureau, which is unfortunately not counted by zip code, but by blocks that make up the congressional districts. To estimate the population of zip codes, we will use the 2010 census zip code tabulation records, which approximate the zip codes in which the congressional district blocks lay.
To find all zip codes in an HRR we will use a Zip Code to HRR crosswalk.
Anytime a new file is downloaded, that file is cached in the “cached-data” folder. Anytime a request is made for a dataset by date, this folder is checked first.
All data sources are downloaded from the internet. Only the needed portions of the datasets are requested from the corresponding endpoints. Before downloading, this application first checks if the dataset has already been downloaded in a local cache file. Whenever a new portion of a dataset is downloaded, it is saved to the cache folder local to the application folder.
Vaccination rates are obtained from the CDC: https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-County/8xkx-amqh. This dataset is large, so instead of downloading the whole dataset, this application accesses it through the SODA API and retrieves only the relevant rows and columns.
The “COVID-19 Vaccinations in the United States,County” data provides counts and percentages of people who have been vaccinated in each county of the United States.
The variables retrieved are:
Hospital bed usage counts is obtained from HealthData.gov: https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/anag-cw7u. This dataset is large, so instead of downloading the whole dataset, this application accesses it through the SODA API and retrieves only the relevant rows and columns.
The “COVID-19 Reported Patient Impact and Hospital Capacity by Facility” data provides counts on hospital bed utilization that is aggregated weekly.
The variables retrieved are:
Inpatient bed counts are used instead of total bed counts because many hospitals that only have inpatient bed data do not include data for inpatient and outpatient totals.
We also calculate:
The hospital bed dataset contains records for each hospital. For mapping, each map region will need to represent the average of all hospitals present in that region.
Population data is obtained from the United States Census Bureau using their 2010 ZCTA to County Relationship File (zcta_county_rel_10.txt)
download: https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt column descriptions: https://www.census.gov/programs-surveys/geography/technical-documentation/records-layout/2010-zcta-record-layout.html#par_textimage_0
County shape data is obtained from the R package
Albersusa.
Shape data for USA HRR regions are downloaded from arcgis.com using their “FeatureServer” REST api.
https://www.arcgis.com/home/item.html?id=46bf6790c4e0455e9379ee9769b1a5ab
This crosswalk is obtained from Dartmouth Atlas as a zip file.
Vaccination data and hospital bed usage data are loaded during the call to the graphing function so that only the date needed is loaded into memory.
# Setup the data caching folder
create_data_folder_if_dne()## [1] "Founds cached data folder."
# get county shapes for mapping
us_county_shape_data <- counties_sf("laea") %>%
mutate(fips = as.numeric(as.character(fips)))## old-style crs object detected; please recreate object with a recent sf::st_crs()
# get state shapes for mapping
us_state_shape_data <- usa_sf("laea")## old-style crs object detected; please recreate object with a recent sf::st_crs()
# Get hrr shapes for mapping
hrr_shape_data <- get_hrr_shapes_arcgis()## Reading layer `hhr_shapes_arcgis_data' from data source
## `C:\Users\Cinde\source\Intro to Data Science\Covid_Hospitalization_and_Vaccination\cached-data\hhr_shapes_arcgis_data.shape'
## using driver `GeoJSON'
## Simple feature collection with 304 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7319 ymin: 24.84407 xmax: -66.95047 ymax: 49.38309
## Geodetic CRS: WGS 84
## [1] "HRR shapes data loaded from local cache."
# Get Required Crosswalks
zip_hrr_crosswalk_data <- get_hrr_zip_crosswalk()## [1] "Loaded zip hrr crosswalk data from local cache."
# get population census data
zcta_population_data <- get_zcta_population_data()## [1] "Loaded ZCTA codes and population data from local cache."
hrr_to_state <- hrr_shape_data %>%
st_drop_geometry() %>%
select(HRRNUM, HRRSTATE_long)A dataset is needed that tracks the population percentages of each ZCTA in each county for the function that calculates vaccination rates of HRRs
calculate_hrr_population_zip_slices <- function(){
# get population totals of hrr
hrr_population_zip_slice = zip_hrr_crosswalk_data %>%
left_join(
zcta_population_data,
by = c("zipcode19" = "ZCTA5")) %>%
group_by(hrrnum) %>%
summarise(population = sum(POPPT, na.rm = T), zip_count = n())
## check how well the join worked ##
total_zcta_pop = sum(zcta_population_data$POPPT, na.rm = T)
us_territories_pop_wiki = 3623895
usa_zcta_pop = total_zcta_pop - us_territories_pop_wiki
total_hrr_pop = sum(hrr_population_zip_slice
$population, na.rm = T)
percent_retained = format(total_hrr_pop / total_zcta_pop * 100, digits = 4)
number_lost = format(total_zcta_pop - total_hrr_pop, big.mark=",")
print(paste(
"After joining US state zip codes with ZCTAs, We keep about",
percent_retained,
"% of the population, loosing",
number_lost, "people." ))
print(paste(
"zctas also account for the population in us territories, which have a total population of",
format(us_territories_pop_wiki, big.mark = ","), "according to wikipedia."))
print(paste(
"This pushes the retained popoulation of the states closer to",
format(total_hrr_pop / usa_zcta_pop * 100, digits = 4), "percent" ))
return(hrr_population_zip_slice)
}
hrr_population_zip_slice <- calculate_hrr_population_zip_slices()## [1] "After joining US state zip codes with ZCTAs, We keep about 98.8 % of the population, loosing 3,738,434 people."
## [1] "zctas also account for the population in us territories, which have a total population of 3,623,895 according to wikipedia."
## [1] "This pushes the retained popoulation of the states closer to 99.96 percent"
We are using population data for zip code tabulation areas because we don’t have counts for zip codes. ZCTAs will be our proxy for zip codes. Because zcta’s don’t line up exactly with zip codes, the same zcta can be part of more than one county. That is why it is important to know how much of the ZCTA’s population is in each county.
The Process:
calculate_hrr_vaccination_rates <- function(date) {
vaccination_data = get_vaccination_rates_data(date)
zip_hrr_vaccination_population_join = vaccination_data %>%
inner_join(zcta_population_data, by = c("fips" = "GEOID")) %>%
inner_join(zip_hrr_crosswalk_data, by = c("ZCTA5" = "zipcode19")) %>%
select(ZCTA5, fips, POPPT, series_complete_pop_pct, administered_dose1_pop_pct, hrrnum) %>%
rename(
zcta = ZCTA5,
zip_pop_slice_in_fips = POPPT,
fips_vaccination_percent = series_complete_pop_pct,
fips_one_dose_percent = administered_dose1_pop_pct) %>%
inner_join(hrr_population_zip_slice, by = "hrrnum") %>%
rename(total_hrr_population = population)
hrr_vacc_percent = zip_hrr_vaccination_population_join %>%
mutate(
num_vacc_in_zip_slice = fips_vaccination_percent * zip_pop_slice_in_fips, # calculate number in zip of fully vaccinated
num_one_dose_in_zip_slice = fips_one_dose_percent * zip_pop_slice_in_fips) %>% # calculate number in zip of one dose
mutate(
zip_slice_weighted_vacc_percent = num_vacc_in_zip_slice / total_hrr_population, # calculate weighted portion of fully vaccinated
zip_slice_weighted_single_dose_percent = num_one_dose_in_zip_slice / total_hrr_population) %>% # calculate weighted portion of single dose
select(hrrnum, zip_slice_weighted_vacc_percent, zip_slice_weighted_single_dose_percent) %>%
group_by(hrrnum) %>%
summarise(
vacc_complete_percent = sum(zip_slice_weighted_vacc_percent),
single_dose_percent = sum(zip_slice_weighted_single_dose_percent))
}These functions retrieve data for vaccination rates and hospital bed usage from their respective APIs
possible_axis_labels <- c(
"vacc_complete_percent" = "Population Fully Vaccinated",
"single_dose_percent" = "Population With Single Dose",
"bed_usage_ratio" = "Hospital Beds Used",
"covid_bed_usage_ratio" = "Hospital Beds Used for Covid",
"covid_bed_usage_total_bed_usage_ratio" = "Covid Bed to Total Bed Usage")
### Method for selecting appropriate dates for data
valid_vaccination_dates <- read_csv("valid_vaccination_dates.csv", show_col_types = F)$Date
valid_bed_usage_dates <- read_csv("valid_bed_usage_dates.csv", show_col_types = F)$collection_week
closest_valid_dates <- function(date, list) {
closest_index = which.min(abs(ymd(date) - valid_vaccination_dates))
vac_date = valid_vaccination_dates[closest_index]
closest_index = which.min(abs(ymd(date) - valid_bed_usage_dates))
bed_date = valid_bed_usage_dates[closest_index]
return(c(vac_date, bed_date))
}
graph_interactive <- function(graph){
ggplotly(graph, tooltip = "text") %>%
style(hoveron = "fill")
}### Vaccination Rate by HHR Graphing Function
### Stat Selectable Interactive Graph of Vaccination Rates by HRR
#
# Values for `x_axis` and `y_axis`:
# vacc_complete_percent: Percentage of people fully vaccinated in that HRR
# single_dose_percent: Percentage of people with one vaccine dose in that HRR
# bed_usage_ratio: Percentage of hospital beds used in HRR
# covid_bed_usage_ratio: Percentage of beds used for COVID in HRR
# covid_bed_usage_total_bed_usage_ratio: Ratio of COVID bed usage to total bed usage
#
# Date must be given in yyyymmdd format
Generate_Vaccination_Plot <- function(date, x_axis, y_axis){
x_axis_stat = rlang::parse_expr(x_axis)
y_axis_stat = rlang::parse_expr(y_axis)
x_axis_label = paste0(possible_axis_labels[x_axis])
y_axis_label = paste0(possible_axis_labels[y_axis])
dates = closest_valid_dates(ymd(date))
# Get Vaccination Data
vaccination_data_per_hrr = calculate_hrr_vaccination_rates(dates[1])
## Get Hospital Data
hospital_bed_data_per_hospital = get_bed_utilization_data(dates[2]) %>% # don't include hospitals with 0 beds
filter(inpatient_beds_7_day_avg > 1, state != "TX") # TX doesn't have vaccination data
# calculate bed usage ratios
hospital_bed_data_per_hospital = hospital_bed_data_per_hospital %>%
mutate(
bed_usage_ratio = inpatient_beds_used_7_day_avg / inpatient_beds_7_day_avg * 100,
covid_bed_usage_ratio = inpatient_beds_used_covid_7_day_avg / inpatient_beds_7_day_avg * 100,
covid_bed_usage_total_bed_usage_ratio = inpatient_beds_used_covid_7_day_avg / inpatient_beds_used_7_day_avg
)
# Group Hospitals in HRR
bed_ratios_per_hrr = hospital_bed_data_per_hospital %>%
left_join(zip_hrr_crosswalk_data, by = c("zip" = "zipcode19")) %>%
group_by(hrrnum) %>%
summarise(
bed_usage_ratio = mean(bed_usage_ratio, na.rm = TRUE),
covid_bed_usage_ratio = mean(covid_bed_usage_ratio, na.rm = TRUE),
covid_bed_usage_total_bed_usage_ratio = mean(covid_bed_usage_total_bed_usage_ratio, na.rm = TRUE))
# Combine Vaccination and Bed Data (and hrr/state crosswalk)
bed_utilization_vaccination_data_hrr = bed_ratios_per_hrr %>%
left_join(vaccination_data_per_hrr, by = "hrrnum") %>%
left_join(hrr_population_zip_slice, by = "hrrnum") %>%
left_join(hrr_to_state, by = c("hrrnum" = "HRRNUM"))
plot_data = bed_utilization_vaccination_data_hrr %>%
drop_na() %>% #drop unplottable rows
mutate(text = paste0(
"HRR #: ", hrrnum,
"</b>\nState: ", HRRSTATE_long,
"</b>\nFully Vaccinated: ", format(vacc_complete_percent, digits = 2), "%",
"</b>\nHad Single Dose: ", format(single_dose_percent, digits = 1), "%",
"</b>\nZip Code Count: ", zip_count,
"</b>\nHRR Pop: ", format(population, big.mark = ","))) %>%
mutate(vacc_complete_percent = vacc_complete_percent / 100,
single_dose_percent = single_dose_percent/100,
covid_bed_usage_ratio = covid_bed_usage_ratio/100,
bed_usage_ratio = bed_usage_ratio/100) %>%
ggplot(aes(x = !!x_axis_stat, y = !!y_axis_stat)) +
geom_point(color = "black", size = 1.4) +
geom_point(aes(color = population, text=text), size = 1.2) +
scale_color_continuous(
"HRR Population",
trans = "log10",
type = "gradient",
labels = scales::comma,
low = "blue",
high = "gold") +
geom_smooth(method = lm, se = F, linewidth = 0.5) +
scale_x_continuous(x_axis_label, labels=scales::percent) +
scale_y_continuous(y_axis_label, labels=scales::percent)
}`
graph_interactive <- function(graph){
ggplotly(graph, tooltip = "text") %>%
style(hoveron = "fills")
}
##### VACCINATION RATE
#####
# Date must be given in yyyymmdd format
Graph_Vaccination_Rates_By_County <- function(date) {
vaccination_data = get_vaccination_rates_data(date = date)
vaccination_data = us_county_shape_data %>%
left_join(vaccination_data, by = "fips")
vaccination_data %>%
ggplot() +
geom_sf(aes(fill = series_complete_pop_pct/100)) +
scale_fill_continuous("Fully Vaccinated", low="red", high="yellow", labels = scales::percent) +
ggtitle("COVID Vaccination Status", subtitle = "Percentage of county population that is fully vaccinated (from CDC)") +
my_map_theme()
}
# Date must be given in yyyymmdd format
Graph_Vaccination_Rates_By_Hrr <- function(date) {
vaccination_data <- calculate_hrr_vaccination_rates(date)
hrr_ggplot_data = hrr_shape_data %>%
left_join(vaccination_data, by = c("HRRNUM" = "hrrnum")) %>%
select(!HRR) %>%
st_transform(crs= "EPSG:2163") %>%
mutate(text = paste0(
"State: ", HRRSTATE_long,
"</b>\nFully Vaccinated: ", format(vacc_complete_percent, digits = 4), "%",
"</b>\nHad Single Dose: ", format(single_dose_percent, digits = 4), "%",
"</b>\nHRR #: ", HRRNUM,
"</b>\nZip Code Count: ", hrr_population_zip_slice$zip_count[HRRNUM],
"</b>\nHRR Pop: ", hrr_population_zip_slice$population[HRRNUM]))
hrr_ggplot_data %>%
ggplot() +
geom_sf(
aes(fill = vacc_complete_percent/100 + runif(nrow(hrr_ggplot_data), min=0, max=0.001),
text=text),
linewidth = 0.1,
color=alpha("black",0.5)) +
# geom_sf(data = us_state_shape_data, fill = alpha("black", 0.0)) +
scale_fill_continuous("Fully Vaccinated", low="red", high="yellow", labels = scales::percent) +
ggtitle("COVID Vaccination Status", subtitle = "Percentage of county population that is fully vaccinated (from CDC)") +
my_map_theme()
}
##### HOSPITAL BED USAGE
#####
# Date must be given in yyyymmdd format
Graph_Hospital_Bed_Usage_By_County <- function(date){
hospital_bed_data = get_bed_utilization_data(date = date)
# calculate ratios
hospital_bed_data = hospital_bed_data %>%
mutate(
bed_usage_ratio = inpatient_beds_used_7_day_avg / inpatient_beds_7_day_avg * 100,
covid_bed_usage_ratio = inpatient_beds_used_covid_7_day_avg / inpatient_beds_7_day_avg * 100,
covid_bed_usage_total_bed_usage_ratio = inpatient_beds_used_covid_7_day_avg / inpatient_beds_used_7_day_avg
)
county_bed_ratios = hospital_bed_data %>%
group_by(fips_code) %>%
summarise(
bed_usage_ratio = mean(bed_usage_ratio, na.rm = TRUE),
covid_bed_usage_ratio = mean(covid_bed_usage_ratio, na.rm = TRUE),
covid_bed_usage_total_bed_usage_ratio = mean(covid_bed_usage_total_bed_usage_ratio, na.rm = TRUE))
us_county_shape_data %>%
left_join(county_bed_ratios, by = c("fips" = "fips_code")) %>%
ggplot() +
geom_sf(aes(fill = covid_bed_usage_ratio/100)) +
scale_fill_continuous("Beds Used for Covid", low="deepskyblue", high="green", labels = scales::percent) +
ggtitle("Hospital Bed Usage", subtitle = "Percentage of county's total hospital beds used by covid patients") +
my_map_theme()
}
#####
# Date must be given in yyyymmdd format
Graph_Hospital_Bed_Usage_By_HRR <- function(date){
hospital_bed_data = get_bed_utilization_data(date)
# calculate ratios
hospital_bed_data = hospital_bed_data %>%
mutate(
bed_usage_ratio = inpatient_beds_used_7_day_avg / inpatient_beds_7_day_avg * 100,
covid_bed_usage_ratio = inpatient_beds_used_covid_7_day_avg / inpatient_beds_7_day_avg * 100,
covid_bed_usage_total_bed_usage_ratio = inpatient_beds_used_covid_7_day_avg / inpatient_beds_used_7_day_avg
)
hrr_bed_ratios = hospital_bed_data %>%
left_join(zip_hrr_crosswalk_data, by = c("zip" = "zipcode19")) %>%
group_by(hrrnum) %>%
summarise(
bed_usage_ratio = mean(bed_usage_ratio, na.rm = TRUE),
covid_bed_usage_ratio = mean(covid_bed_usage_ratio, na.rm = TRUE),
covid_bed_usage_total_bed_usage_ratio = mean(covid_bed_usage_total_bed_usage_ratio, na.rm = TRUE))
hrr_ggplot_data = hrr_shape_data %>%
left_join(hrr_bed_ratios, by = c("HRRNUM" = "hrrnum")) %>%
select(!HRR) %>%
st_transform(crs= "EPSG:2163")
hrr_ggplot_data %>%
ggplot() +
geom_sf(aes(fill = covid_bed_usage_ratio/100), linewidth = 0, color=alpha("black",0.02)) +
geom_sf(data = us_state_shape_data, fill = alpha("black", 0.0)) +
scale_fill_continuous("Beds Used for Covid", low="purple", high="orange", labels = scales::percent) +
ggtitle("Hospital Bed Usage", subtitle = "Percentage of hospital referral region's beds used by covid patients") +
my_map_theme()
}date = "2021-09-23"
plot_data = Generate_Vaccination_Plot(date, "vacc_complete_percent", "covid_bed_usage_ratio")## [1] "Vaccination records loaded from local cache."
## [1] "Hospital bed data loaded from local cache."
## Warning in geom_point(aes(color = population, text = text), size = 1.2):
## Ignoring unknown aesthetics: text
graph_interactive(plot_data)## `geom_smooth()` using formula = 'y ~ x'
plot_data = Generate_Vaccination_Plot(date, "covid_bed_usage_ratio", "vacc_complete_percent")## [1] "Vaccination records loaded from local cache."
## [1] "Hospital bed data loaded from local cache."
## Warning in geom_point(aes(color = population, text = text), size = 1.2):
## Ignoring unknown aesthetics: text
graph_interactive(plot_data)## `geom_smooth()` using formula = 'y ~ x'
plot_data = Generate_Vaccination_Plot(date, "single_dose_percent", "covid_bed_usage_total_bed_usage_ratio")## [1] "Vaccination records loaded from local cache."
## [1] "Hospital bed data loaded from local cache."
## Warning in geom_point(aes(color = population, text = text), size = 1.2):
## Ignoring unknown aesthetics: text
graph_interactive(plot_data)## `geom_smooth()` using formula = 'y ~ x'
Graph_Vaccination_Rates_By_County("2021/03/1")## [1] "Vaccination records loaded from local cache."
Graph_Vaccination_Rates_By_County("2021/09/24")## [1] "Vaccination records loaded from local cache."
graph_interactive(Graph_Vaccination_Rates_By_Hrr("2021/09/24"))## [1] "Vaccination records loaded from local cache."
## Warning in layer_sf(geom = GeomSf, data = data, mapping = mapping, stat =
## stat, : Ignoring unknown aesthetics: text